SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization
نویسندگان
چکیده
In this paper, we propose and analyze SQuARM-SGD, a communication-efficient algorithm for decentralized training of large-scale machine learning models over network. each node performs fixed number local SGD steps using Nesterov's momentum then sends sparsified quantized updates to its neighbors regulated by locally computable triggering criterion. We provide convergence guarantees our general (non-convex) convex smooth objectives, which, the best knowledge, is first theoretical analysis compressed with updates. show that rate SQuARM-SGD matches vanilla SGD. empirically including in can lead better test performance than current state-of-the-art which does not consider
منابع مشابه
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradient...
متن کاملFaster Asynchronous SGD
Asynchronous distributed stochastic gradient descent methods have trouble converging because of stale gradients. A gradient update sent to a parameter server by a client is stale if the parameters used to calculate that gradient have since been updated on the server. Approaches have been proposed to circumvent this problem that quantify staleness in terms of the number of elapsed updates. In th...
متن کاملStatistical inference using SGD
We present a novel method for frequentist statistical inference in M -estimation problems, based on stochastic gradient descent (SGD) with a fixed step size: we demonstrate that the average of such SGD sequences can be used for statistical inference, after proper scaling. An intuitive analysis using the OrnsteinUhlenbeck process suggests that such averages are asymptotically normal. From a prac...
متن کاملGeneralized Byzantine-tolerant SGD
We propose three new robust aggregation rules for distributed synchronous Stochastic Gradient Descent (SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server (PS) architecture. We prove the Byzantine resilience properties of these aggregation rules. Empirical analysis shows that the ...
متن کاملRevisiting Distributed Synchronous SGD
Distributed training of deep learning models on large-scale training data is typically conducted with asynchronous stochastic optimization to maximize the rate of updates, at the cost of additional noise introduced from asynchrony. In contrast, the synchronous approach is often thought to be impractical due to idle time wasted on waiting for straggling workers. We revisit these conventional bel...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE journal on selected areas in information theory
سال: 2021
ISSN: ['2641-8770']
DOI: https://doi.org/10.1109/jsait.2021.3103920